Wavelets  for  regression  smoothing 


Wavelet  And  Multifractal  Analysis  2004 

Summer  School 
Corsica,  July  19  -  31,  2004 

Anestis  Antoniadis 
University  Joseph  Fourier 


.  -  p.l/55 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

07  JAN  2005 

2.  REPORT  TYPE 

N/A 

3.  DATES  COVERED 

4.  TITLE  AND  SUBTITLE 

Wavelets  for  regression  smoothing 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  Joseph  Fourier 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release,  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

See  also  ADM001750,  Wavelets  and  Multifractal  Analysis  (WAMA)  Workshop  held  on  19-31  July  2004., 

The  original  document  contains  color  images. 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

uu 

18.  NUMBER 
OF  PAGES 

55 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Outline 


Generalities  on  nonparametric  regression 

Penalized  regression  splines 

Penalized  wavelet  estimation  with  quadratic 
penalties 

Other  penalties  and  model  selection 
Extensions 


Generalities 


Regression  models  describe  the  dependence  of  the 
response  variable  of  interest  Y  on  one  or  more  predictor 
variables  X. 

A  basic  analysis  starts  with  a  random  sample  of  size  n 
from  the  distribution  of  (X,Y)  where  the  conditional 
mean  and  variance  of  the  zth  response  Yt  are  given  by 

E  (Yj/X  =  Xj)  =  fj(xi)  Var  (Yf/X  =  Xj)  =  cr2(Xi)  =  o2. 

The  parameter  a  is  a  scale  parameter  which  is  assumed 
to  be  constant  in  what  follows. 
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Nonparametric  methods 


Nonparametric  models  are  those  models  where  the  form 
of  the  conditional  expectations  is  dictated  by  the  data. 
Two  approaches: 

3  Fit  simple  parametric  models  locally,  e.g.,  moving 
averages  or  local  polynomial  estimation. 

3  Fit  a  highly  complex  parametric  model  with  a 

complexity  penalty  to  prevent  "overfitting".  Roughly, 
"overfitting"  means  fitting  the  noise,  not  the  signal,  so 
that  features  are  detected  that  would  not  be  present  in 
an  independent  replicate  of  the  data. 


Both  approaches  require  a  "smoothing  parameter." 
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Spline  based  methods 


We  will  focus  on  nonparametric  penalized  regression 
methods  involving  the  use  of  basis  functions  and  quadratic 
penalties,  pointing  out  some  basic  principles  they  have  in 
common  with  splines  when  fitting  regular  curves  and 
wavelets  when  fitting  less  regular  curves.  More  precisely, 
we  will  consider  the  estimation  of  t](x)  through  the 
minimization  of 

zn(v)  =  lit(yi-v(xi))2  +  V(ti),  (i) 

I L  •  -| 

i=l 

where  /( • )  is  a  roughness  functional.  The  parameter  A  con¬ 
trols  the  trade-off  between  lack  of  fit  of  rj  and  roughne^/55 


To  estimate  r]  ( • )  in  a  flexible  manner  we  represent  it  as  a 

linear  combination  of  known  basis  functions 
{h,  k  =  l,...,K}, 


K 

v(x)  =  E  PM*), 

k= 1 

and  then  try  to  estimate  the  coefficients 

P  =  (Pi,---,Pk)T- 

Usually  the  number  K  of  basis  functions  used  in  the 
representation  of  i]  should  be  large  in  order  to  give  a 
fairly  flexible  way  for  approximating  t / . 
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Popular  examples  of  such  basis  functions  %  are  wavelets 
and  polynomial  splines. 

A  crucial  problem  is  the  choice  of  the  number  K  of  basis 
functions.  A  small  K  may  result  in  a  function  space 
which  is  not  flexible  enough  to  capture  the  variability  of 
the  data,  while  a  large  number  of  basis  functions  may 
lead  to  serious  overfitting. 

Traditional  ways  of  "smoothing"  are  through  basis 
selection  see  e.g.  Friedman  and  Silverman  (1989), 
Friedman  (1991)  and  Stone  et  al.  (1997)  or  regularization. 
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For  x  given,  let  H  be  the  matrix  whose  columns  are 
hk{xi),  for  k  =  1, . . .  ,K.  We  have  rj{x)  =  H{x) jS.  Denote 

Ly(/5)  =  E  (y<  -  ■ 

Z  =  1 


Then  (??)  becomes 

/  I' 
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Power  basis 


Polynomial  regression  splines  are  continuous  piecewise 
polynomial  functions  where  the  definition  of  the 
function  changes  at  a  collection  of  knot  points,  which  we 
write  as  fi  <  •  •  •  <  fjc- 

Using  the  notation  z+  =  max(0,z),  then,  for  an  integer 
p  >  1,  the  truncated  power  basis  for  polynomial  of 
degree  p  regression  splines  with  knots  t\  <  ■  ■  ■  <  is 

{1,  X,  . . . ,  xp ,  (x  —  ti)  +  ,  . . . ,  (x  —  tKf+}. 

?](x,p)  =  H(x)jS  is  a  pth  degree  polynomial  on  each 
interval  between  knots  and  has  p  —  1  continuous 
derivatives  everywhere.  Refer  to  Eilers  and  Marx  (1996) 
and  Ruppert  and  Carroll  (2000).  , 


Example  of  Regression  Splines 


Basis  functions  of  piecewise  linear  spline 


Basis  functions  multiplied  by  coefficients 


Spline 


Illustration  of  a  Truncated  Power  Basis  construction. 
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P-splines 


/V 

Let  {$  minimize 

PSSU})  =  LyU i)  +  AEf=i^+; 

This  penalized  least-squares  estimator  is  called  a 
P-spline  (term  due  to  Eilers  and  Marx  (1996)).  Other 
quadratic  penalties  on  are  possible. 

3  In  particular,  one  can  find  B  so  that 

3  The  choice  of  spline  basis  is  unrestricted:  Truncated 
power  functions  are  not  needed,  though  we  find 
them  convenient.  A  natural  cubic  spline  basis  could 
be  used,  so  that  cubic  smoothing  splines  are  a 
special  case  of  P-splines. 
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Smoothing  parameter 


Here  A  >  0  is  a  penalty  or  regularization  weight.  As 
A  — >  oo  ,  the  penalized  spline  converges  to  a  pth  degree 
polynomial  fit.  As  A  — >  oo,  the  penalized  spline 
converges  to  the  OLS  fitted  spline. 

The  coefficient  \ 6„+j  is  the  jump  in  the  pth  derivative  at 
the  knot  l-,.  Thus,  the  penalty  is  on  the  (p  +  1 ) th 
derivative,  interpreted  as  a  generalized  function. 


.  -  p.12/55 


P-splines  -  cont. 


Let  y  =  (; 1/1 ,  •  •  •  ,yn)  arid  D  =  diag(0, . . .  ,0,1, . . .  ,1) 
(p  +  l  zeros  and  K  ones).  Then 


PSS(jS)  =  ||y  —  HjS||2  +  A||D/S 


Since  j8(A)  solves  dPSS( jS)/3jS  =  0  we  have 


jS(A)  =  (HrH  +  AD)  1HTy. 


-IttT, 


(Ridge  regression) 
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Cross  Validation  (CV) 


Goal:  Choose  A. 

y\ 

Yj  ( x,  j6(A) )  is  the  fitted  spline  with  penalty  weight  A. 

yv 

Y]-i(x,  /2(A))  is  the  fitted  spline  with  penalty  weight  A 
and  without  using  (x;/y;  ). 

CV  (A)  =  J^{yi-rj-i(xi,H  A))} 

1  =  1 

Principle:  Choose  A  that  minimizes  CV  (A). 

Problem:  CV  is  computationally  intensive. 

Solution:  Use  generalized  cross  validation  (GCV),  an  ap¬ 
proximation  to  CV. 
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Generalized  cross  validation 


The  Smoother  or  Hat  Matrix 


S( A)  :=  H(HtH  +  \D)~lH 


—  1  ttT 


Degrees  of  freedom:  DF( A)  =  tr(S(A)) 
GCV: 


S(A)y 


GCV(A)  = 


(1  —  DF(A) /n)2 


So  use  A  that  minimizes  GCV  (A)  and  define  jS  =  p(A) 
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Penalized  wavelet  regression 


We  will  now  be  concerned  with  developing  a 
counterpart  to  the  spline  smoothing  technique  for  the 
case  of  fitting  less  regular  curves. 

Several  variants  of  this  penalized  approach  have  been 
suggested  by  several  authors  (Devore  and  Lucier  (1992), 
Antoniadis  (96),  Amato  and  Vuza  (97),  Dechevsky  and 
Penev  (99). 

The  regularity  of  the  curve  is  discussed  in  terms  of  the 
size  of  its  norm  in  a  Sobolev  space  with  a  relatively  small 
value  of  the  positive  smoothness  index. 

We  will  also  study  a  specific  version  of  generalized  full 

cross  validation  for  the  choice  of  the  smoothing 
parameter. 
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Setup  and  assumptions 


Considering  the  structure  of  the  wavelet  multiresolution, 
the  design  points  Xj  will  be  assumed  to  have  the  form 

X\  =  z’A,  with  A  =  2~n  for  z  =  0, . . . ,  n  —  1,  where 
N  =  log2  n,  n  being  the  sample  size. 

For  simplicity  of  presentation  we  shall  work  with 
periodic  wavelets  on  the  interval  on  [0, 1]. 

Functions  cp  and  ip  will  respectively  denote  the  scaling 
function  and  the  wavelet  associated  to  a  periodic 

(/-regular  multiresolution  analysis  of  L2(  [0, 1  [). 
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Penalized  wavelet  regression  -  cont. 


We  will  assume  that  r]  e  Hv ( [0, 1  [),  where  HV(HV ( [0, 1  [) 
denotes  the  Sobolev  space  of  order  v  {v  >  1/2). 

Assume  further  that  the  scaling  function  (p  is  a  coiflet  of 
order  L  with  L  >  [v]  + 1. 

The  nested  structure  of  a  multiresolution  analysis  leads 
to  an  efficient  tree-structured  algorithm  for  the 
decomposition  of  functions  in  for  which  the 
coefficients  (rj,  (pNj()  are  given. 


When  a  function  is  given  in  sampled  form  there  is  no  gen¬ 


eral  method  for  deriving  the  coefficients  (rj,  (pN,k)-  We  will 
approximate  the  projection  Py  by  some  operator  II ^  in 
terms  of  the  sampled  values  rj  ( ^ ) .  _  p  18/5, 


Since  the  coiflets  have  L  vanishing  moments,  we  have: 

The  set  of  non  zero  coefficients  (k)  —  2 N^2{t],  <pN,k)  has  a 

cardinality  equivalent  to  0(n).  Moreover ;  with  L  >  [v]  + 1, 
the  following  uniform  (in  0  <  k  <  2N  —  1)  bound  holds: 

l«lN>(*)-’/(^)l<Ci2-Nl' 

where  Ci  is  a  constant  only  depending  on  the  coiflet  (p. 

One  is  therefore  able  to  approximate  the  coefficients 
(q,  (pN/k)  with  an  error  0(2_N/22_Nl/). 
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It  is  now  natural  to  approximate  PyN  >]  by 


(nN*/)(0 


Using  such  an  approximation  we  have: 

l|Pv'N7-nN>;|U<0(2-N’') 

Observing  that  IE ( Y. )  —  i/(i )  justifies  completely 
replacing  the  original  data  by  the  “raw"  function 

2n_i 

fjN{t)  =  2~N/1  Y  y^N,fc(f) 

k= 0 
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For  any  r]  e  Hv  ( [0, 1] ),  fl^f/  can  be  expanded  as 


2*)-l  N  21-1 

=  YY  ah,k<pj0rk  +  T.  T.  djjS’jjx- 

k= 0  /=/o  fc=0 


The  orthogonal  discrete  wavelet  transform  on  the  in¬ 
terval,  say  W,  applied  on  identified  by  the  vec¬ 
tor  y  of  observed  values,  gives  a  new  sequence 
of  numbers  {  (vj0/k) 


k=0,...,2>0-l 


IV  jk I  }/ 


say  ( (Vj'k),  (wj'k))  for  short,  that  are  interpreted  as  the  co¬ 
efficients  of  the  expansion  of  the  function 
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Under  the  noise  model,  noise  contaminates  all  wavelet 
coefficients  of  n equally. 

Consequently,  the  empirical  wavelet  coefficients 
Uvj'k),  ( wl/k )  j ,  obtained  by  applying  the  discrete 

orthogonal  transform  on  the  data  vector  2-N/2y,  can  be 
considered  as  a  version  of  the  coefficients  na^),  ( djj( )  j 
of  UNt],  contaminated  by  a  similar  white  noise  €;^  of 
variance  cr2/2N. 
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The  regularization  problem  is  now  given  by 


inl\  \\vn-v\\2  +  M\pv^v 

rjeH  I  70 


2 

n 


and  using  the  equivalent  sequence  norms  and  these 
expression,  minimizing  the  penalized  functional  given 
above  is  equivalent  to  minimize  the  expression 


2/0-1 


E 


N  2/  -1 

+  E  E 

j=jo  k=0 


( 


W 


j,k 


dj,k)2  +  A22v>d?jfc 


where  W;^  =  0  for  j  >  N  and  k  =  0, ...  ,21  —  1. 
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We  thus  have  ajQ/k  =  VjQik,  k  —  0, . . . ,  2/0  —  1  and 


Wjjt 


dkk  i  +  A22  vj 


j  >jo,k  =  0,..., 2-i  -  1 , 


that  is: 


VA  =  LZo  1  Vjo,k<t>j0,k  +  Lf=jo  LfJ  djrklpj/k , 


a  tapered  wavelet  series  estimator  of  rj,  i.e.  fj,\  may  be 
viewed  as  the  result  of  passing  the  "raw"  wavelet  series 
estimate  fj jv  through  a  low  pass  filter  controlled  by  the 
parameters  A  and  v.  We  see  that  the  resulting  estimator 
is  a  linear  estimator  of  shrinkage  type.  Shrinking  is  level 
dependent.  Up  to  level  jo  there  is  no  shrinking;  shrinking 
of  the  wavelet  coefficients  is  heavier  at  higher  levels. 
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Asymptotics 


Asymptotic  minimax  optimality  theory  (see  Delyon  and 
Juditsky  (1996))  allow  us  to  make  the  above  choice  for  jo 
and  N.  It  turns  out  that  jo  should  satisfy  an  inequality  of 

the  type  (Cn)1^2v+1')  <  2^°  <  2 (Cn)1^2v+1')  with  C  >  0 


suitably  chosen.  A  choice  2N  =  0{nj  log  n)  is  motivated 
by  the  observation  that  for  v  >  1/2  (when  Hv  contains 

only  continuous  functions),  both  djj, \  <  0(n-1/2)  and 


hold  simultaneously.  Hence,  for 


any  n  there  is  no  point  in  going  to  levels  of  j  with  2 J  >  n 

since  the  order  of  the  coefficient  estimated  is  smaller  than 
the  order  of  the  error  of  estimation.  The  logarithmic 

factor  in  the  choice  of  N  ensures  better  behaviour  of  the 
estimator  for  continuous  and  smooth  functions. 
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Asymptotics-cont. 


Let  us  introduce  the  mean  square  error 

ER(A)  -  E  {jE,"=i(fo(*t)  “  V(xi))2}  (the  quantity  we 
are  interested  in  controlling).  We  then  have 

I/A  =  O  (n-2v/^v+ !))  then  ER( A)  =  O  (n~2v/(2v+1A  . 

The  estimator  is  linear  in  the  observations.  Asymptotic 
minimax  optimality  theory  suggests  that  linear  wavelet 
estimators  may  in  general  be  outperformed  by  nonlinear 
ones  when  no  a  priori  information  is  available  for  the 
curve.  But  for  v  >  1/2  this  case  is  within  the  parameter 
range  where  linear  estimators  have  the  same  asymptotic 
rates  as  the  best  non-linear  ones. 


.  -  p.26/55 


Choosing  A 


Essentially  two  methods: 

M  One  is  generalized  cross-validation  used  by  Amato 
and  Vuza  (97),  Jansen  et  al.  (1997),  Dechevski  and 
Penev  (1999)  who  have  chosen  to  work  directly  with 
a  wavelet  analogue  of  the  cross-validation  formula 
for  smoothing  splines,  overcoming  some 
“compatibility  condition"  which  doesn't  hold  for 
the  wavelet  case. 

3  Another  method  (Antoniadis  (96))  completely 

avoids  the  compatibility  problem  but  the  price  to  be 
paid  for  this  is  that  the  estimator  depends  explicitly 
on  the  noise  variance.  To  deal  with  this,  an 
estimator  of  the  noise  variance  is  incorporated 

within  the  definition  of  the  estimator.  .  -  p.27/55 


GFCV  for  A 


,2/0—1 


a 


:o,k<Pjo,k  +  E/I/o  Efc=o  and 


]o 


'N 

7=/o 

'N  sr-  27-1 
7=/o 


k= 0  "joM-i^h’k  +  E/=/o  E^=o  with 


Denote  77 A  =  E/f=0 

V\  -Efc=o  « 

%M-i)  =  (n~  i)-1  Le=i,eji:i<l>j0jc(xe)ye and 
djM-i)  =  (n  -  I)”1  E”=w* 

Then,  paralleling  the  P-splines  case  the  typical  cross 
validation  functional  to  be  minimized  with  respect  to  A  is 


CV(A)  = 
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Compatibility  condition 


Reduction  in  computation  for  minimizing  CV  is 
achieved  if  the  "compatibility  condition"  holds.  First 
define  y ^ (A)  J/i/  •  •  •  >yi— hyi+i/  •  •  •  >yn)- 

Compatibility  means  that 

y  {— f)  (y^')  yA.{xif  j/i/  •  •  •  /  j/f-i/  j/(-z)  (/^-)/  j/f+i/  •  •  •  /  j/«) 

holds. 

Under  the  above  condition  the  cross-validation 
functional  can  be  expressed  in  terms  of  the  ordinary 

residuals: 


CV{  A)  = 


E(y<  -  ?/a(*/))2/(!  -  KiW)2,  hu{ A) 

I  L  •  ^ 

1=1 


dy{ 


Unfortunately,  for  a  shrinking  type  estimator  the  compat- 
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Compatibility  condition-cont. 


The  idea  is  to  opt  for  another  alternative  by  changing  the 

cross  validation  criterion  itself.  Note  that  standard  cross 
validation  is  defined  entirely  with  respect  to  samples  of 

size  (n  —  1).  One  can  adjust  it  for  samples  of  size  n  as 
suggested  by  Bunke  et  al.  (1993).  In  their  approach,  the 
value  of  \)i  is  replaced  by  ( Xj )  instead  of  leaving  it  out 

in  defining  the  prediction  of  the  z-th  design  point.  The 
resulting  functional  is  called  the  FCV  (full  cross 
validation)  functional : 

FCl/(A)  =  if;(yI-f)A(x,))2. 

n,ti 
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Compatibility  condition-cont. 


It  turns  out  that  under  the  condition  of  linearity  only,  one 
gets  in  terms  of  the  ordinary  residuals: 

FCV(A)  =  i  -  tjx(xi))2  ■  (1  +  h ,(A))2, 

l  L  • 

i=l 


where 


1 

n 


IN/  ^ 

L<Pi,k(xi)  +  E  E 1  ,  w-yfei cW2 

k  Ho  k  i  +  AZ 


Following  the  same  idea,  as  for  GCV  one  defines  the 
generalized  FCV  (GFCV)  by  replacing  ha( A)  by 

M  1  Yl’i’—l  h[i  (  A)  .  .  -p.31/55 


Asymptotics  again 


Denote  A  =  argminAE(GFCV(A))  and 

A*  =  argminAEF(A).  One  can  then  show  that 

Ifrje  £>2/2  Fl  B^/00  where  0  <  v\  <  v'  <  v  <  r,  if  jo  —  0(1), 
2n  =  0(n/ logn),  then  for  n  large  enough, 

A  and  A*  exist  and  I  1  as  n  — >  oo.  Moreover, 

E(GFCV(A))  ~  ER( A)  +  (72 

holds  for  large  n,  in  a  neighborhood  of  A*,  uniformly  in  A  >  0. 
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An  example 


-0.2  0.0  0.2  0.4  0.6  0.8  1.0 


Simulated  fit 

True  curve:  solid  line;  the  penalized  estimator:  dashed  line; 
soft  thresholded  estimator:  dotted  line,  n  =  1024,  jo  =  1, 

N  =  6,  cr2  =  0.2.  . - p.33/5 


Other  penalties 


The  wavelet  estimation  procedure  described  before  is 
controlled  by  a  quadratic  penalty  and  as  such  produces 
linear  estimates  that  have  good  rates  for  smooth 
functions  only  Moreover,  it  does  not  take  into  account 
that  most  functions  have  usually  a  sparse  wavelet 
representation. 

In  order  to  deal  with  such  problems  a  variety  of  other 
type  of  penalties  have  been  proposed  in  the  recent 
literature,  see  for  example  Solo  (1998),  McCoy  (1999), 
Moulin  and  Liu  (1999),  Beige  et  al.  (2000),  Antoniadis 
and  Fan  (2001)  to  cite  only  a  few. 
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In  Solo  (1998),  the  penalized  least-squares  with  an 
L\  penalty  is  modified  to  a  weighted  least  squares  in 

order  to  deal  with  correlated  noise  and  an  iterative 
algorithm  is  discussed  for  its  solution.  The  choice  of 

the  regularization  parameter  is  not  discussed. 

By  analogy  to  smoothing  splines,  E.  J.  McCoy  (1999) 
uses  a  penalty  function  which  simultaneously 
penalizes  the  residual  sum  of  squares  and  the 
second  derivative  of  the  estimator  at  the  design 
points.  For  a  given  regularization  parameter,  the 
solution  of  the  resulting  optimization  problem  is 
found  using  simulated  annealing,  but  there  is  no 
suggestion  on  a  possible  method  of  chosing  the 
smoothing  parameter  in  her  work.  p35/5 


In  Moulin  and  Liu  (1999),  the  soft  and  hard 
thresholded  estimators  appear  as  MAP  estimators  in 
the  context  of  Bayesian  estimation  under  zero-one 
loss,  with  generalized  Gaussian  densities  serving  as 
a  prior  distribution  for  the  wavelet  coefficients  (see 
also  Leoporini  &  Pesquet  (1998)). 

A  similar  approach  is  also  used  by  Beige  (2000)  in 
the  context  of  wavelet  domain  image  restoration. 
The  smoothing  parameter  in  Beige  (2000)  is  selected 
by  the  L-curve  criterion  (see  Hansen  and 
O'Leary  (1993)).  It  is  however  known  (see 
Vogel  (1996))  that  such  a  criterion  can  sometimes 
lead  to  nonconvergent  solutions  when  the  function 
to  be  recovered  presents  some  irregularities.  p36/5 


Setup 


Assume  again  that  the  design  points  are  t ,  =  i/2N  for 
i  =  0, n  —  1  and  N  =  log2  n. 

Let  //  be  the  underlying  regression  function  collected  at 
all  dyadic  points  {// 2) ,  i  =  0, ...  ,2^  —  1}. 

Apply  the  Wavelet  Transform  on  //:  6  —  Wij  and 
t]  —  WT0,  to  get  an 
Overparametrized  linear  model: 

Yn  =  WT0  +  e. 
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The  Wavelet  basis  on  which  t]  is  projected  is  chosen  by 
fixing  the  resolution  N.  The  estimate  of  0  and  therefore 
of  t]  is  recovered  by  penalized  least-squares 


2 


WT0 


+  Pa(|0| 

ielN 


) 


Since  W  is  an  orthonormal  matrix  this  is  equivalent  in 
minimizing  componentwize 


2  1J^(zi-ei)2  +  \J^p( \0i\), 

1  =  1  z‘>z‘o 


where  Zj  is  the  ith  row  of  z  =  WY n  ■ 
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General  types  of  Penalties 


Several  penalty  functions  have  been  used  in  the  literature. 


M  The  L2  penalty  p(6) 
regression. 


yields  a  ridge  type 


.•  The  Li  penalty  p(6)  =  |0|  results  in  soft  thresholding 
rule  introduced  by  Bickel  (1983)  and  used  by  Donoho 
and  Johnstone  (1994)  in  the  wavelet  setting. 

3  The  penalty  p\(\0\)  =  A2  —  (|0|  —  \)2I(\9\  <  A),  leads 
to  the  hard-thresholding  rule  (Antoniadis,  1997). 


The  mixture  penalty  p\(\6\)  =  A min( 1 6 \ ,  A) . 


More  generally,  the  Ln  (0  <  q  <  1)  penalty  leads  to 
bridge  regression  (see  Frank  and  Friedman  (1993))  .  -  p.39/55 


Conditions  on  p 


Usually,  the  penalty  function  p  is  chosen  to  be  symmetric 
and  increasing  on  [0,  +oo).  Furthermore,  p  can  be  convex 
or  non-convex,  smooth  or  non-smooth. 

In  the  wavelet  setting,  Antoniadis  and  Fan  (2001)  provide 
some  insights  into  how  to  choose  a  penalty  function.  A 
good  penalty  function  should  result  in 

3  unbiasedness, 

3  sparsity , 

3  stability. 
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Examples 


Penalty  function 

Convexity 

Smoothness  at  0 

Authors 

vm  =  ij3i 

yes 

p'(0+)  =  1 

(Rudin  1992) 

p(P)  =  w,  “  G  (0,1) 

no 

p'(0+)  =  oo 

(Saquib  1998) 

p(W  =  « I0l/(i +  «IPI) 

no 

p'(0+)  =  a 

(Geman  92,  95) 

p(0)  =  0,  p(P)  =  1,Vj8  7^  0 

no 

discontinuous 

Leclerc  1989 

pu 5)  =  ipr,  *  >  i 

yes 

yes 

Bouman  1993 

p(/3)  =  a]S2/(l  +  <xp2) 

no 

yes 

McClure  1987 

p(jS)  =  minja^2, 1} 

no 

yes 

Geman  1984 

P(P)  =  ^/a  +  /32 

yes 

yes 

Vogel  1987 

p(f$)  =  log(cosh(a/$)) 

yes 

yes 

Green  1990 

B2/2  if  161  <  a, 

Ptf)={  '  ,,  f 

yes 

yes 

Huber  1990 

|  a  |/5|  —  a2/2  if  |jS|  >  a. 

Examples  of  penalty  functions 
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Discussion 


Some  neccessary  conditions  for  unbiasedness,  sparsity 
and  stability  have  been  derived  by  Antoniadis  and  Fan 
(2001). 


unbiasedness  <->■ 
sparsity 
stability 


p(\0\)  =  0  for  large  1 9 
6\  +  \p(\0\)  >  0 
argminj  \fi\  +  Xp(\6\)}  =  0 


From  the  above,  a  penalty  satisfying  the  conditions  on 
sparsity  and  stability  must  be  non-smooth  at  0.  In  the 
two  extremes,  the  hard  thresholding  rule  is 
discontinuous  (instable),  while  the  soft  thresholding  rule 
shifts  the  estimator  by  an  amount  of  A  even  when  |  z(  | 
stands  way  out  of  noise  level,  which  creates  unnecessary 
bias  when  6  is  large.  p42/ 


SCAD  penalty 


To  ameliorate  these  drawbacks,  we  may  use  the  SCAD 
penalty  introduced  by  Fan  (1999), 

m  =  i{e<A)  +  ^±i(e>x), 

for  9  >  0  and  a  >  2  (usually  a  =  3.7),  wich  leads  to 


sgn(z/)(|zy|  -  A)+ 

(a—l)zj—a\Sgn(zj) 
a— 2 


v 


when 


7 


<  2A 


when  2 A  <  \zA  <  aX 
when  \zA  >  aX 


which  is  concave  on  [0,  oo)  and  does  not  intend  to  over 
penalize  large  1 9 
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Illustration 


(b)  Hard  thresholding 


(c)  Smoothly  clipped  L  penalty 


(d)  Transformed  L  penalty 


0  0 


Plot  of  penalty  functions 
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Thresholding  functions 


(c)  Smoothly  clipped  L1  penalty  (d)Transformed  L1  penalty 


Plot  of  thresholding  functions 
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Performance  of  thresholding  rules 


To  compare  the  thresholding  rules  we  apply  the  tool  of 
exact  risk  analysis.  The  closed  forms  for  the  L2  risk 

functions  R(9,  A,  a2)  —  M(9  —  9)2,  for  the  hard  and 
soft-thresholding  rules  have  been  derived  by  Donoho 
and  Johnstone  (1994).  It  is  easy  to  show  that 

R(9,  \,<j2)  =  <t2R(9 /<j,A/<j,1) 

For  simplicity  we  denote  by  R(9,  A)  the  risk  function  for 
a  —  1 .  If  interested,  you  can  find  the  expressions  of  the 
risk  for  most  of  the  penalties  reviewed  above  in  Anto- 
niadis  and  Fan  (2001)  when  Z  ~  N{9, 1).  5 


Risk  functions  under  quadratic  loss 


Risk  functions  of  three  thresholding  rules 

To  make  the  scale  roughly  comparable,  we  took  A  =  2  for 
the  hard- thresholding  rule  and  adjusted  the  values  of  A 

for  the  other  rules  so  that  their  estimated  values  are  the 
same  when  6  —  3. 
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Choosing  A 


The  performance  of  the  penalized  least-squares  estimator 
depends  on  the  regularization  parameter  A.  Again  a 
convenient  way  to  get  a  data  based  estimate  of  A  is  by 
using  generalized  cross  validation.  However  the 
minimization  problem  we  are  dealing  with  is  not 
quadratic  and  it  is  not  obvious  how  such  a  method  may 
be  applied. 

Tibshirani  (1996)  proposed  a  GCV-type  criterion  for 
choosing  the  tuning  parameter  for  the  LASSO  through  a 
ridge  estimate  approximation.  Of  course,  this 
approximation  ignores  some  variability  in  the  estimation 
process  but  the  simulation  study  in  Tibshirani  (1996) 
suggests  that  it  is  a  useful  approximation. 
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Choosing  A-cont. 


The  smoothly  clipped  Lj  penalty  p\  is  not  differentiable. 
However  it  can  be  locally  approximated  by  a  quadratic 
function  as  follows.  Let  0q  be  a  given  initial  value  that  is 
close  to  the  solution  of  the  minimization  problem.  When 
|0o|  >  0,  we  use 

Pxi\0\)  -  Pa(I^oI)  +  ^P\(\Oo\){02  -  0§),  0-0o 

and  0  if  0o  =  0.  The  figure  that  follows  shows  the  above 
approximations  for  two  different  values  of  0q  . 
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Quadratic  approximation 


(c)  SCAD  penalty 


The  penalty  and  its  quadratic  approximations 

Setting  the  E^(0)  =  diag i  e  In)  the  solution  can 
be  found  iteratively  computing  a  ridge  regression 
problem  at  each  iteration.  At  convergence,  one  then  uses 
the  usual  GCV  criterion  for  ridge  regression  type 
problems. 
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Asymptotic  properties 


Assume  that  the  signal  7/  is  in  a  Besov  ball.  Because  of 
simple  characterization  of  this  space  via  the  wavelet 
coefficients  of  its  members,  the  Besov  space  ball  £>JA(C) 

can  be  defined  as 


/e  VE(2/(t,+1/2-1/p)ll0;- 
; 


<o. 


where  0,.  is  the  vector  of  wavelet  coefficients  at  the  reso¬ 
lution  level  j.  Here,  v  indicates  the  degree  of  smoothness 
of  the  underlying  signal  rj. 


.-p.51/55 


Note  that  the  wavelet  coefficients  6  in  the  definition  of 
the  Besov  space  are  continuous  wavelet  coefficients.  They 
are  approximately  a  factor  of  ft1/2  larger  than  the  discrete 
wavelet  coefficients  Wfj.  This  is  equivalent  to  assume  that 
the  noise  level  is  of  order  1/n. 
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Suppose  that  the  penalty  function  p\(-)  is  a  nonnegative, 
nondecreasing  and  differentiable  function  in  (0,  oo).  Fur¬ 
ther,  assume  that  the  function  —0  —  p\(9)  is  strictly  uni- 
modal  on  (0,  oo)  and  that  p\  (0+)  >  0.  Assume  also  that 
v  +  1/2  —  1/p  >  0.  Then,  the  maximum  risk  of  the  penal¬ 
ized  least-squares  estimator  fj\  over  the  Besov  ball  £>!)  „  (C) 
is  of  rate  0(n~2v^2v+1'>  log  n )  when  the  universal  thresh¬ 
olding  log  n  is  used. 
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The  overheads  of  this  lecture  are  available  at: 

http:  /  /  www-lmc.imag.fr/lmc-sms  /  Anestis.Antoniadis  / 

Software  in  Matlab  implementing  theses  methods  (and 
many  more)  has  been  developed  by  Antoniadis,  Bigot 
and  Sapatinas  and  is  available  at 

http:  /  /  www-lmc.imag.fr/ SMS/ software/ wavden 

Thank  you  for  your  attention. 
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